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Abstract 

In  this  paper,  we  review  the  problem  of  selecting  relevant  features  for  use  in  machine  learning.  We 
describe  this  problem  in  terms  of  heuristic  search  through  a  space  of  feature  sets,  and  we  identify 
four  dimensions  along  which  approaches  to  the  problem  can  vary.  We  consider  recent  work  on 
feature  selection  in  terms  of  this  framework,  then  close  with  some  challenges  for  future  research  in 
this  promising  area. 
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Abstract 

In  this  paper,  we  review  the  problem  of  selecting  rele¬ 
vant  features  for  use  in  machine  learning.  We  describe 
this  problem  in  terms  of  heuristic  search  through  a 
space  of  feature  sets,  and  we  identify  four  dimensions 
along  which  approaches  to  the  problem  can  vary.  We 
consider  recent  work  on  feature  selection  in  terms  of 
this  framework,  then  close  with  some  challenges  for 
future  work  in  the  area. 

1.  The  Problem  of  Irrelevant  Features 

The  selection  of  relevant  features,  and  the  elimina¬ 
tion  of  irrelevant  ones,  is  a  central  problem  in  machine 
learning.  Before  an  induction  algorithm  can  move  be¬ 
yond  the  training  data  to  make  predictions  about  novel 
test  cases,  it  must  decide  which  attributes  to  use  in 
these  predictions  and  which  to  ignore.  Intuitively,  one 
would  like  the  learner  to  use  only  those  attributes  that 
are  ‘relevant’  to  the  target  concept. 

There  have  been  a  few  attempts  to  define  ‘relevance’ 
in  the  context  of  machine  learning,  as  John,  Kohavi, 
and  Pfleger  (1994)  have  noted  in  their  review  of  this 
topic.  Because  we  will  review  a  variety  of  approaches, 
we  do  not  take  a  position  on  this  issue  here.  We  will 
focus  instead  on  the  task  of  selecting  relevant  features 
(however  defined)  for  use  in  learning  and  prediction. 

Many  induction  methods  attempt  to  deal  directly 
with  the  problem  of  attribute  selection,  especially  ones 
that  operate  on  logical  representations.  For  instance, 
techniques  for  inducing  logical  conjunctions  do  little 
more  than  add  or  remove  features  from  the  concept 
description.  Addition  and  deletion  of  single  attributes 
also  constitute  the  basic  operations  of  more  sophisti¬ 
cated  methods  for  inducing  decision  lists  and  decision 
trees.  Some  nonlogical  induction  methods,  like  those 
for  neural  networks  and  Bayesian  classifiers,  instead 
use  weights  to  assign  degrees  of  relevance  to  attributes. 
And  some  learning  schemes,  such  as  the  simple  nearest 
neighbor  method,  ignore  the  issue  of  relevance  entirely. 

We  would  like  induction  algorithms  that  scale  well 
to  domains  with  many  irrelevant  features.  More  specif¬ 
ically,  we  would  like  the  sample  complexity  (the  num¬ 
ber  of  training  cases  needed  to  reach  a  given  level  of 


accuracy)  to  grow  slowly  with  the  number  of  irrele¬ 
vant  attributes.  Theoretical  results  for  algorithms  that 
search  restricted  hypothesis  spaces  are  encouraging. 
For  instance,  the  worst-case  number  of  errors  made 
by  Littlestone’s  (1987)  WiNNOW  method  grows  only 
logarithmically  with  the  number  of  irrelevant  features. 
Pazzani  and  Sarrett’s  (1992)  average-case  analysis  for 
Wholist,  a  simple  conjunctive  algorithm,  and  Lang¬ 
ley  and  Iba’s  (1993)  treatment  of  the  naive  Bayesian 
classifier,  suggest  that  their  sample  complexities  grow 
at  most  linearly  with  the  number  of  irrelevant  features. 

However,  the  theoretical  results  are  less  optimistic 
for  induction  methods  that  search  a  larger  space  of 
concept  descriptions.  For  example,  Langley  and  Iba’s 
(1993)  average-case  analysis  of  simple  nearest  neighbor 
indicates  that  its  sample  complexity  grows  exponen¬ 
tially  with  the  number  of  irrelevant  attributes,  even 
for  conjunctive  target  concepts.  Experimental  stud¬ 
ies  of  nearest  neighbor  are  consistent  with  this  conclu¬ 
sion,  and  other  experiments  suggest  that  similar  results 
hold  even  for  induction  algorithms  that  explicitly  se¬ 
lect  features.  For  example,  the  sample  complexity  for 
decision-tree  methods  appears  to  grow  linearly  with 
the  number  of  irrelevants  for  conjunctive  concepts,  but 
exponentially  for  parity  concepts,  since  the  evaluation 
metric  cannot  distinguish  relevant  from  irrelevant  fea¬ 
tures  in  the  latter  situation  (Langley  &  Sage,  in  press). 

Results  of  this  sort  have  encouraged  machine  learn¬ 
ing  researchers  to  explore  more  sophisticated  methods 
for  selecting  relevant  features.  In  the  sections  that  fol¬ 
low,  we  present  a  general  framework  for  this  task,  and 
then  consider  some  recent  examples  of  work  on  this 
important  problem. 

2.  Feature  Selection  as  Heuristic  Search 

One  can  view  the  task  of  feature  selection  as  a  search 
problem,  with  each  state  in  the  search  space  specifying 
a  subset  of  the  possible  features.  As  Figure  1  depicts, 
one  can  impose  a  partial  ordering  on  this  space,  with 
each  child  having  exactly  one  more  feature  than  its 
parents.  The  structure  of  this  space  suggests  that  any 
feature  selection  method  must  take  a  stance  on  four 
basic  issues  that  determine  the  nature  of  the  heuristic 
search  process. 
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Figure  1.  Each  state  in  the  space  of  feature  subsets  specifies  the  attributes  to  use  during  induction.  Note  that  the  states  in 
the  space  (in  this  case  involving  four  features)  are  partially  ordered,  with  each  of  a  state’s  children  (to  the  right)  including 
one  more  attribute  (dark  circles)  than  its  parents. 


First,  one  must  determine  the  starting  point  in  the 
space,  which  in  turn  determines  the  direction  of  search. 
For  instance,  one  might  start  with  no  features  and 
successively  add  attributes,  or  one  might  start  with 
all  attributes  and  successively  remove  them.  The  for¬ 
mer  approach  is  sometimes  called  forward  selection, 
whereas  the  latter  is  known  as  backward  elimination. 
One  might  also  select  an  initial  state  somewhere  in  the 
middle  and  move  outward  from  this  point. 

A  second  decision  involves  the  organization  of  the 
search.  Clearly,  an  exhaustive  search  of  the  space  is 
impractical,  as  there  exist  2“  possible  subsets  of  a  at¬ 
tributes.  A  more  realistic  approach  relies  on  a  greedy 
method  to  traverse  the  space.  At  each  point  in  the 
search,  one  considers  local  changes  to  the  current  set  of 
attributes,  selects  one,  and  then  iterates,  never  recon¬ 
sidering  the  choice.  A  related  approach,  known  as  step¬ 
wise  selection  or  elimination,  considers  both  adding 
and  removing  features  at  each  decision  point,  which 
lets  one  retract  an  earlier  decision  without  keeping  ex¬ 
plicit  track  of  the  search  path.  Within  these  options, 
one  can  consider  all  states  generated  by  the  operators 
and  then  select  the  best,  or  one  can  simply  choose  the 
first  state  that  improves  accuracy  over  the  current  set. 
One  can  also  replace  the  greedy  scheme  with  more  so¬ 
phisticated  methods,  such  as  best-first  search,  which 
are  more  expensive  but  still  tractable  in  some  domains. 

A  third  issue  concerns  the  strategy  used  to  evaluate 
alternative  subsets  of  attributes.  One  broad  class  of 
strategies  considers  attributes  independently  of  the  in¬ 
duction  algorithm  that  will  use  them,  relying  on  gen¬ 
eral  characteristics  of  the  training  set  to  select  some 
features  and  exclude  others.  John,  Kohavi,  and  Pfleger 


(1994)  call  these  filter  methods,  because  they  filter  out 
irrelevant  attributes  before  the  induction  process  oc¬ 
curs.  They  contrast  this  approach  with  wrapper  meth¬ 
ods,  which  generate  a  set  of  candidate  features,  run  the 
induction  algorithm  on  the  training  data,  and  use  the 
accuracy  of  the  resulting  description  to  evaluate  the 
feature  set.  Within  this  approach,  one  must  still  pick 
some  estimate  for  accuracy,  but  this  choice  seems  less 
central  than  settling  on  a  filter  or  wrapper  scheme. 

Finally,  one  must  decide  on  some  criterion  for  halting 
search  through  the  space  of  feature  subsets.  Within  the 
wrapper  framework,  one  might  stop  adding  or  remov¬ 
ing  attributes  when  none  of  the  alternatives  improves 
the  estimate  of  classification  accuracy,  one  might  con¬ 
tinue  to  revise  the  feature  set  as  long  as  accuracy  does 
not  degrade,  or  one  might  continue  generating  can¬ 
didate  sets  until  reaching  the  other  end  of  the  search 
space  and  then  select  the  best.  Within  the  filter  frame¬ 
work,  one  criterion  for  halting  notes  when  each  combi¬ 
nation  of  values  for  the  selected  attributes  maps  onto 
a  single  class  value.  Another  alternative  simply  orders 
the  features  according  to  some  relevancy  score,  then 
uses  a  system  parameter  to  determine  the  break  point. 

Note  that  the  above  methods  for  feature  selection 
can  be  combined  with  any  induction  algorithm  to  in¬ 
crease  its  learning  rate  in  domains  with  irrelevant  at¬ 
tributes.  The  effect  on  behavior  may  differ  for  different 
induction  techniques  and  for  different  target  concepts, 
in  some  cases  producing  little  benefit  and  in  others  giv¬ 
ing  major  improvement.  But  the  bcisic  idea  of  search¬ 
ing  the  space  of  feature  sets  is  conceptually  and  practi¬ 
cally  distinct  from  the  specific  induction  method  that 
benefits  from  the  feature-selection  process. 
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3.  Recent  Work  on  Feature  Selection 

The  problem  of  feature  selection  has  long  been  an  ac¬ 
tive  research  topic  within  statistics  and  pattern  recog¬ 
nition  (e.g.,  Devijver  &  Kittler,  1982),  but  most  work 
in  this  area  has  dealt  with  linear  regression.  In  the  past 
few  years,  feature  selection  has  received  considerable 
attention  from  machine  learning  researchers  interested 
in  improving  the  performance  of  their  algorithms. 

The  earliest  approaches  to  feature  selection  within 
machine  learning  emphasized  filtering  methods.  For 
example,  Almuallim  and  Dietterich’s  (1991)  FOCUS  al¬ 
gorithm  starts  with  an  empty  feature  set  and  carries 
out  breadth-first  search  until  it  finds  a  minimal,  combi¬ 
nation  of  features  that  predicts  pure  classes.  The  sys¬ 
tem  then  passes  the  reduced  feature  set  to  IDS,  which 
constructs  a  decision  tree  to  summarize  the  training 
data.  Schlimmer  (1993)  described  a  related  approach 
that  carries  out  a  systematic  search  (to  avoid  revisiting 
states)  through  the  space  of  feature  sets,  again  starting 
with  the  empty  set  and  adding  features  until  it  finds  a 
combination  consistent  with  the  training  data. 

Kira  and  Rendell  (1992)  used  a  quite  different  scheme 
for  filtering  attributes.  Their  RELIEF  algorithm  as¬ 
signs  a  weight  to  each  feature  that  reflects  its  ability 
to  distinguish  among  the  classes,  then  selects  those  fea¬ 
tures  with  weights  that  exceed  a  user-specified  thresh¬ 
old.  The  system  then  uses  IDS  to  induce  a  decision 
tree  from  the  training  data  using  only  the  selected  fea¬ 
tures.  Relief  does  not  quite  fit  into  our  framework, 
as  it  imposes  a  linear  ordering  on  the  features  rather 
than  searching  the  partially  ordered  space  of  feature 
sets.  Kononenko  (1994)  reports  two  extensions  to  the 
method  that  handle  non-Boolean  attributes,  and  Doak 
(1992)  has  explored  similar  approaches  to  the  problem. 

Although  Focus  and  Relief  follow  feature  selection 
with  decision-tree  construction,  one  can  also  combine 
the  former  with  other  induction  methods.  For  instance, 
Cardie  (1993)  used  a  filtering  approach  to  identify  a 
subset  of  features  for  use  in  nearest  neighbor  retrieval, 
whereas  Kubat,  Flotzinger,  and  Pfurtscheller  (1993) 
filtered  features  for  use  with  a  naive  Bayesian  classifier. 
Both  used  C4.5  to  construct  a  decision  tree  from  the 
data,  but  only  to  determine  the  features  to  be  passed 
to  their  primary  induction  methods. 

Most  recent  research  on  feature  selection  differs  from 
these  early  methods  by  relying  on  wrapper  strategies 
rather  than  filtering  schemes.  The  general  argument 
for  wrapper  approaches  is  that  the  induction  method 
that  will  use  the  feature  subset  should  provide  a  better 
estimate  of  accuracy  than  a  separate  measure  that  may 
have  an  entirely  different  inductive  bias.  John,  Kohavi, 
and  Pfieger  (1994)  were  the  first  to  present  the  wrap¬ 
per  idea  as  a  general  framework  for  feature  selection. 
Their  own  work  has  emphasized  its  combination  with 
decision- tree  methods,  but  they  also  encourage  its  use 
with  other  induction  algorithms. 


The  generic  wrapper  technique  must  still  use  some 
measure  to  select  among  alternative  features.  One 
natural  scheme  involves  running  the  induction  algo¬ 
rithm  over  the  entire  training  data  using  a  given  set  of 
features,  then  measuring  the  accuracy  of  the  learned 
structure  on  the  training  data.  However,  John  et  al.  ar¬ 
gue  convincingly  that  a  cross-validation  method,  which 
they  use  in  their  implementation,  provides  a  better 
measure  of  expected  accuracy  on  novel  test  cases. 

John  et  al.  also  review  existing  definitions  of  rele¬ 
vance  in  the  context  of  machine  learning  and  propose 
a  new  definition  that  overcomes  some  problems  with 
earlier  ones.  In  addition,  they  describe  feature  selec¬ 
tion  in  terms  of  heuristic  search  and  review  a  variety 
of  methods  that,  although  designed  for  filter  schemes, 
also  work  within  the  wrapper  approach.  Finally,  they 
carry  out  systematic  experiments  on  a  variety  of  search 
methods  within  the  wrapper  model,  varying  the  start¬ 
ing  point  and  the  available  operators. 

The  major  disadvantage  of  wrapper  methods  over  fil¬ 
ter  methods  is  the  former’s  computational  cost,  which 
results  from  calling  the  induction  algorithm  for  each 
feature  set  considered.  This  cost  has  led  some  re¬ 
searchers  to  invent  ingenious  techniques  for  speeding 
the  evaluation  process.  In  particular,  Caruana  and 
Freitag  (1994)  devised  a  scheme  for  caching  decision 
trees  that  substantially  reduces  the  number  of  trees 
considered  during  feature  selection,  which  in  turn  lets 
their  algorithm  search  larger  spaces  in  reasonable  time. 
Moore  and  Lee  (1994)  describe  an  alternative  scheme 
that  instead  speeds  feature  selection  by  reducing  the 
percentage  of  training  cases  used  during  evaluation. 

Like  John  et  al.,  Caruana  and  Freitag  review  a  num¬ 
ber  of  greedy  methods  that  search  the  space  of  feature 
sets  and  report  on  comparative  experiments  that  vary 
the  starting  set  and  the  operators.  However,  their  con¬ 
cern  with  efficiency  also  led  them  to  examine  the  trade¬ 
off  between  accuracy  and  computational  cost.  More¬ 
over,  their  motivation  for  exploring  feature-selection 
methods  was  more  strict  than  dealing  with  irrelevant 
attributes.  Their  aim  was  to  find  sets  of  attributes  that 
are  useful  for  induction  and  prediction. 

Certainly  not  all  work  within  the  wrapper  frame¬ 
work  has  focused  on  decision-tree  induction.  Langley 
and  Sage’s  (1994a)  Oblivion  algorithm  combines  the 
wrapper  idea  with  the  simple  nearest  neighbor  method. 
Their  system  starts  with  all  features  and  iteratively  re¬ 
moves  the  one  that  leads  to  the  greatest  improvement 
in  accuracy,  continuing  until  the  estimated  accuracy 
actually  declines.  Aha  and  Bankert  (1994)  take  a  simi¬ 
lar  approach  to  augmenting  nearest  neighbor,  but  their 
system  starts  with  a  randomly  selected  subset  of  fea¬ 
tures  and  includes  an  option  for  beam  search  rather 
than  greedy  decisions.  Skalak’s  (1994)  work  on  near¬ 
est  neighbor  also  starts  with  a  random  feature  set,  but 
replaces  greedy  search  with  random  hill  climbing  that 
continues  for  a  specified  number  of  cycles. 
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Table  1.  Chairacterization  of  recent  work  on  feature  selection  in  terms  of  heuristic  search  through  the  space  of  feature  sets. 


Authors  (System)  Starting  Search  Evaluation  Halting 

Point  Control  Scheme  Criterion 


Aha  and  Bankert  (Beam) 
Almuallim/Dietterich  (Focus) 
Cardie 

Caruana  and  Freitag  (CAP) 
Doak 

John,  Kohavi,  and  Pfleger 
Kira  and  Rendell  (Relief) 
Kubat  et  al. 

Langley/Sage  (Oblivion) 
Langley/Sage  (Selective  Bayes) 
Moore  and  Lee  (Race) 
Schlimmer 
Skalak 

Townsend-Weber  and  Kibler 


Random 

Comparison 

None 

Breadth  First 

None 

Greedy 

Comparison 

Greedy 

Random 

Ordering 

Comparison 

Greedy 

— 

Ordering 

None 

Greedy 

All 

Greedy 

None 

Greedy 

Comparison 

Greedy 

None 

Systematic 

Random 

Mutation 

All 

Comparison 

Comparison 

No  Better 

Filter 

Consistency 

Filter 

Consistency 

Wrapper 

All  Used 

Filter 

Threshold 

Comparison 

No  Better 

Filter 

Threshold 

Filter 

Consistency 

Wrapper 

Worse 

Wrapper 

Worse 

Wrapper 

No  Better 

— 

Consistency 

Wrapper 

Enough  Times 

Wrapper 

No  Better 

Most  research  on  wrapper  methods  has  focused  on 
classification,  but  both  Moore  and  Lee  (1994)  and  Town¬ 
send-Weber  and  Kibler  (1994)  have  combined  this  idea 
with  k  nearest  neighbor  for  numeric  prediction.  Also, 
most  work  has  emphasized  the  advantages  of  feature 
selection  for  induction  methods  that  are  sensitive  to 
irrelevant  features,  but  Langley  and  Sage  (1994b)  have 
shown  that  the  naive  Bayesian  classifier,  which  is  sensi¬ 
tive  to  redundant  attributes,  can  benefit  from  the  same 
basic  approach.  This  suggests  that  techniques  for  fea¬ 
ture  selection  can  improve  the  behavior  of  induction 
algorithms  in  a  variety  of  situations,  not  only  in  the 
presence  of  irrelevant  attributes. 

4.  Challenges  for  Future  Research 

Despite  the  recent  activity,  and  the  associated  progress, 
in  methods  for  selecting  relevant  features,  there  remain 
many  directions  in  which  machine  learning  can  improve 
its  study  of  this  important  problem.  One  of  the  most 
urgent  involves  the  introduction  of  more  challenging 
data  sets.  Almost  none  of  the  domains  studied  to  date 
have  involved  more  than  40  features.  One  exception  is 
Aha  and  Bankert’s  study  of  cloud  classification,  which 
used  204  attributes,  but  typical  experiments  have  dealt 
with  many  fewer  features. 

Moreover,  Langley  and  Sage’s  results  with  the  near¬ 
est  neighbor  method  suggest  that  many  of  the  UCI 
data  sets  have  few  if  any  irrelevant  attributes.  In  hind¬ 
sight,  this  seems  natural  for  diagnostic  domains,  in 
which  experts  tend  to  ask  about  relevant  features  and 
ignore  other  ones.  However,  we  believe  that  many  real- 
world  domains  do  not  have  this  character,  and  that  we 
must  find  data  sets  with  a  substantial  fraction  of  irrel¬ 


evant  attributes  if  we  want  to  test  our  ideas  on  feature 
selection  adequately. 

Experiments  with  artificial  data  also  have  important 
roles  to  play  in  the  study  of  feature-selection  methods. 
Such  data  sets  can  let  one  systematically  vary  factors  of 
interest,  such  as  the  number  of  relevant  and  irrelevant 
attributes,  while  holding  other  factors  constant.  In  this 
way,  one  can  directly  measure  the  sample  complexity 
of  algorithms  as  a  function  of  these  factors,  showing 
their  ability  to  scale  to  domains  with  msmy  irrelevant 
features.  However,  we  distinguish  between  the  use  of 
artificial  data  for  such  systematic  experiments  and  re¬ 
liance  on  isolated  artificial  data  sets  (such  as  the  Monks 
problems),  which  seem  much  less  useful. 

More  challenging  domains,  with  more  features  and  a 
higher  proportion  of  irrelevant  ones,  will  require  more 
sophisticated  methods  for  feature  selection.  Although 
further  increases  in  efficiency  would  increase  the  num¬ 
ber  of  states  examined,  such  constant-factor  improve¬ 
ments  cannot  eliminate  problems  caused  by  exponen¬ 
tial  growth  in  the  number  of  feature  sets.  However, 
viewing  these  problems  in  terms  of  heuristic  search  sug¬ 
gests  some  places  to  look  for  solutions.  In  general,  we 
must: 

•  invent  more  intelligent  techniques  for  selecting  an 
initial  set  of  features  from  which  to  start  the  search; 

•  formulate  search-control  methods  that  take  ad¬ 
vantage  of  structure  in  the  space  of  feature  sets; 

•  devise  improved  frameworks  (better  even  than  the 
wrapper  method)  for  evaluating  the  usefulness  of 
alternative  feature  sets; 

•  design  better  halting  criteria  that  will  improve  ef¬ 
ficiency  without  sacrificing  useful  feature  sets. 
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Naturally,  the  details  of  these  extensions  remain  to  be 
discovered,  but  each  holds  significant  potential  for  in¬ 
creasing  the  ability  of  selection  methods  to  handle  re¬ 
alistic  domains  with  many  irrelevant  features. 

Future  research  in  the  area  should  also  compare  fea¬ 
ture  selection  to  attribute-weighting  schemes.  In  the 
limit,  attribute  weighting  should  outperform  selection 
in  domains  that  involve  different  degrees  of  relevance, 
but  the  introduction  of  weights  also  increases  the  num¬ 
ber  of  hypotheses  considered  during  induction,  which 
can  slow  learning.  Thus,  each  approach  has  some  ad¬ 
vantages,  leaving  an  open  question  that  is  best  an¬ 
swered  by  experiment,  but  preferably  by  informed  ex¬ 
periments  designed  to  test  specific  hypotheses  about 
these  two  approaches  to  relevance.  Resolving  such  ba¬ 
sic  issues  promises  to  keep  the  field  of  machine  learning 
occupied  for  many  years  to  come. 
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